Skip to content

Support for Qwen-Image, Qwen-Image-Edit, and Qwen-Image-Edit-Plus#2072

Merged
Acly merged 27 commits intoAcly:mainfrom
Danamir:qwen-image
Oct 8, 2025
Merged

Support for Qwen-Image, Qwen-Image-Edit, and Qwen-Image-Edit-Plus#2072
Acly merged 27 commits intoAcly:mainfrom
Danamir:qwen-image

Conversation

@Danamir
Copy link
Contributor

@Danamir Danamir commented Oct 5, 2025

Basic support of Qwen-Image models, including Nunchaku SVDQ quantized versions.

I did not touch the models and node auto-installation part as I'm not familiar with it. For now you can try this PR as long as you have a Qwen-Image model downloaded (normal, gguf, or svdq) and a recent ComfyUI and ComfyUI-nunchaku versions.

You can also load a Lightning LoRA from https://huggingface.co/lightx2v/Qwen-Image-Lightning .

When using Lightning versions I recommend creating specific presets with minimum_steps set to 1. For example :

    "ER SDE - BETA (lightning)": {
        "sampler": "er_sde",
        "scheduler": "beta",
        "steps": 4,
        "cfg": 1.0,
        "minimum_steps": 1
    }

NB : I'll try to add Qwen-Image-Edit support, and maybe Qwen-Image-Edit-Plus (ie. 2509) basic support with only one layer. For this later one it will heavily limit the possibilities offered by the model, but I'll rather have a separated PR for the UI modifications needed to handle multiple edit sources.

  • Qwen architecture definitions : Arch.qwen, Arch.qwen_e, Arch.qwen_e_p
  • VAE
  • Text encoder
  • Qwen-Image checkpoint & GGUF, Nunchaku-SVDQ
  • Qwen-Image-Edit checkpoint & GGUF, Nunchaku-SVDQ
    • TextEncodeQwenImageEdit + ReferenceLatent
  • Qwen-Image-Edit-Plus checkpoint & GGUF, Nunchaku-SVDQ
    • TextEncodeQwenImageEditPlus + chained ReferenceLatent nodes
  • UI Edit mode + add references
    • Bug on UI when using Qwen-Image and linked Edit model (will be fixed later)
  • Auto CPU offloading

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

This should fix #1939, #2032, #2066.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

Basic support for Qwen-Image-Edit done, with TextEncodeQwenImageEdit (not the Plus version). I purposefully left the VAE input empty to force the use of ReferenceLatent as it fixes the unzoom edit bug.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

I fixed in the code the settings that works for my GPU for NunchakuQwenImageDiTLoader, but it should be left to the user, maybe in the performance settings ?

model = w.nunchaku_load_qwen_diffusion_model(
    model_info.filename,
    cpu_offload="enable",
    num_blocks_on_gpu=16,
    use_pin_memory="disable",
)

@Acly
Copy link
Owner

Acly commented Oct 5, 2025

I got an error here if I don't override checkpoint resolution in the style. Probably the resolution range can be flexible like Flux, although I notice the Edit model doesn't follow instructions properly when resolution isn't ~1MP

Apart from that, it works well!

Basic support for Qwen-Image-Edit done, with TextEncodeQwenImageEdit (not the Plus version).

Is there any downside from using the Plus version?

but I'll rather have a separated PR for the UI modifications needed to handle multiple edit sources.

You just create some "Reference" control layers for more image sources. The UI side of this already works and multiple image inputs are handled in apply_edit_conditioning by stitching them together for Flux Kontext. It probably works for Qwen too, but can be improved maybe with the TextEncodeQwenImageEditPlus multi-input node.

I purposefully left the VAE input empty to force the use of ReferenceLatent as it fixes the unzoom edit bug.

It's still unclear to me if this is really a bug, or an issue with input resolutions. Or what exactly the difference is between passing the image to the encode node vs. using ReferenceLatent...

I fixed in the code the settings that works for my GPU for NunchakuQwenImageDiTLoader, but it should be left to the user, maybe in the performance settings ?

I don't think this should be burden of the user, unless as a last resort. It's a bit annoying the nunchaku nodes don't figure this out themselves, they have the most information. For the Flux loader there was a heuristic at least.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

I got an error here if I don't override checkpoint resolution in the style. Probably the resolution range can be flexible like Flux, although I notice the Edit model doesn't follow instructions properly when resolution isn't ~1MP

You're right, I always had a custom resolution in my styles. I'll add a valid resolution range.

Is there any downside from using the Plus version?

Sadly the Plus version is pretty bad at transforming the styles (ie. to pixel art, drawing, etc. ). It is better at everything else though. And the 1Mp transform seems to be handled directly by the text encode Plus node.

Right now I was in the way of adding a new architecture Arch.qwen_e_p and a propertie Arch.is_qwen_like . 😅 It may be an overkill...

You just create some "Reference" control layers for more image sources.

I just saw that when copying the flux code, I'll use that for TextEncodeQwenImageEditPlus it should work great ! I'll have to find a way to limit the additional sources to 3.

It's still unclear to me if this is really a bug, or an issue with input resolutions.

I think it's a little bit of both, but there is no real downside to using the ReferenceLatent, it gives different results but those are rarely worse.

Concerning Qwen-Image-Edit-Plus, I find it even more prone to zooming out, and ReferenceLatent can't be used. There were some formulas on reddit to try to find the internal resizing of TextEncodeQwenImageEditPlus but it was really strange, like multiple of 16 but with an offset of 7-10 to the top left...

I don't think this should be burden of the user, unless as a last resort.

We always have the possibility to simply detect the quantity of VRAM, and < 16GB add cpu_offloading=enable, num_blocks_on_gpu=1, use_pin_memory=disable, otherwise disable cpu offloading.

The num_blocks_on_gpu can be left to 1, the performance stays the same, the only change is that you keep some of your system RAM available since more of the model is left on the GPU.

def from_string(s: str, filename: str | None = None):
if s == "svdq":
return Quantization.svdq
elif filename and "qwen" in filename and "svdq" in filename:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed? Was there a svdq file that wasn't detected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some logs to find what was the quantization value, and nothing is returned in the model entity ! I had to resort to the filename trick...

I'll try some more debugging.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right, the quant field wasn't set by the model detection

Fixed here: Acly/comfyui-tooling-nodes@f555efb

@Acly
Copy link
Owner

Acly commented Oct 5, 2025

Right now I was in the way of adding a new architecture Arch.qwen_e_p and a propertie Arch.is_qwen_like . 😅 It may be an overkill...

Yea I think Arch should really be for different architecture or at least "model ecosystem". I'm actually not even sure if Flux Kontext and Qwen Edit deserve their own Arch (edit models are technically just finetunes), but there are lots of things that don't work for edit models so it makes some sense.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

In the meantime I added the Arch.qwen_e_p architecture support. We can always refactor it with a cleaner solution.

It works like a charm with a single image. I just have to solve the problem preventing Qwen-Edit to add references images since it has neither controlnets, nor ipadapter.

image

[edit] : found it, it should do the trick :

if self.mode.is_ip_adapter and models.arch in (Arch.flux_k, Arch.qwen_e_p):
    is_supported = True  # Reference images are merged into the conditioning context

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

Here we go ! Qwen-Image-Edit-Plus support. It even works with selection, which makes it a pretty useful inpaint model.

image ComfyUI_00180_ ComfyUI_00377_ image

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

Still a little bit of a zoom problem (cropped head top), but still pretty good !

"Extract the man in image1, use the poses from image2. White background, concept art."

image

@Danamir Danamir changed the title Qwen image basic support Support for Qwen-Image, Qwen-Image-Edit, and Qwen-Image-Edit-Plus Oct 5, 2025
@shadowlocked
Copy link

This is fabulous work - have been using Qwen on ComfyUI for the first time, waiting for it to hit Krita Diffusion. Only thing I would say is to make sure that the (locked) new Qwen default doesn't have a Nunchaku dependency. After a recent update, I could no longer get the default locked Kontext setting to work, as I didn't have Nunchaku installed and I just got an error.

The problem there was that in the 'default' Kontext, 'Diffusion Architecture' was set to 'Automatic', and of course that cannot be changed by the user. 'Automatic' turned out to have Nunchaku requirements(!). So I just copied the locked default and changed the Diffusion Architecture to 'Flux Kontext'.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

make sure that the (locked) new Qwen default doesn't have a Nunchaku dependency.

The PR has been tested with GGUF and Nunchaku versions. I do not have a non-quantized qwen version, neither do I have a ComfyUI installation without the nunchaku package.

Feel free to give it a try, theoretically there should be no problem without it.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

Here is the code of the internal resize in ComfyUI's TextEncodeQwenImageEditPlus :

samples = image.movedim(-1, 1)
total = int(384 * 384)

scale_by = math.sqrt(total / (samples.shape[3] * samples.shape[2]))
width = round(samples.shape[3] * scale_by)
height = round(samples.shape[2] * scale_by)

s = comfy.utils.common_upscale(samples, width, height, "area", "disabled")
images_vl.append(s.movedim(1, -1))
if vae is not None:
    total = int(1024 * 1024)
    scale_by = math.sqrt(total / (samples.shape[3] * samples.shape[2]))
    width = round(samples.shape[3] * scale_by / 8.0) * 8
    height = round(samples.shape[2] * scale_by / 8.0) * 8

    s = comfy.utils.common_upscale(samples, width, height, "area", "disabled")
    ref_latents.append(vae.encode(s.movedim(1, -1)[:, :, :, :3]))

It looks like a simple fixed ratio resize to 1048576 total pixels (ie. 1024x1024), then rounded to a multiple of 8. There is a strange stuff with movedim(-1, 1) but I'm not familiar with this method.

@Acly
Copy link
Owner

Acly commented Oct 5, 2025

It looks like a simple fixed ratio resize to 1048576 total pixels (ie. 1024x1024), then rounded to a multiple of 8. There is a strange stuff with movedim(-1, 1) but I'm not familiar with this method.

Yea. It resizes all images to (384*384) total pixels while keeping aspect ratio and attaches that to the LLM prompt.

If VAE is provided it also resizes all images to (1024*1024) total pixels while keeping aspect ratio and attaches them to conditioning as if by using the Reference Latent node.

The movedim is just to convert between file/display image layout to inference memory layout.

The question is whether the automatic resize is good or bad. For Flux Kontext I omitted it. Although it follows prompts better when images stay around 1MP, it can really degrade quality. Probably it's similar here. If I understood you correctly, the internal resize makes the "zoom in" behavior more likely.

In any case we should pass the images, and don't need to pass the VAE, since we already handle ReferenceLatent anyway.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

In any case we should pass the images, and don't need to pass the VAE, since we already handle ReferenceLatent anyway.

The problem is that ReferenceLatent cannot be used reliably with multiple images like TextEncodeQwenImageEditPlus does. Every time I tried to combine those two nodes with mode than one image, I did not get a coherent edit. If you only stitch the images together, you get worse results.

I was thinking of a thing more along the line of resizing the expected latent output to the same size of the first internal resize to see if it helps with the drifting.

It's too bad the internal scale cannot be avoided. Maybe I should try some tests with a modified copy of the node...

[edit] : Note: I did not try to chain multiple ReferenceLatent nodes, each with one image. Performance where already greatly affected by only one node.

[edit2] : I inspected the code of ReferenceLatent, it seems to do the exact same thing as the code used in TextEncodeQwenImageEditPlus . It seems you're right, we should be able to ignore the VAE and chain the ReferenceLatent nodes. I'll do some more tests in a custom workflow.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 5, 2025

Yes ! By completely ignoring the resize we got a pixel-perfect edit, even with using a 1856x1440 and a 1536x1536 reference images. The performance were pretty bad because of the size, but the output is really good.

It seems to work even better if the reference latents are resized to have the same short side dimension.

I'll update the code.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 6, 2025

Got a strange bug when using a Qwen-Image model with a linked Edit model in the style.

When I try to add a reference image in Generate mode, it rightly tells me that it's not supported :
image

But if I switch to Edit mode, the reference disappears, and clicking on the + button only add more erroneous references to the Generate side.

image

I'll try and see if I can figure this out, but it's not high on the priorities for the core functionalities.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 6, 2025

Using NunchakuQwenImageDiTLoader.cpu_offload="auto" might be enough for most usages. It automatically turns on the CPU offloading with < 16GB VRAM. The rest of the options are set to the node default values.

@Acly
Copy link
Owner

Acly commented Oct 6, 2025

But if I switch to Edit mode, the reference disappears, and clicking on the + button only add more erroneous references to the Generate side.

Yea that's a bug, the buttons are always linked to the image model regions. I can fix it later.

@Acly
Copy link
Owner

Acly commented Oct 6, 2025

Question: do we really need to support both edit model variants?

From what I understood, 2509 is an evolution and may be further improved. Even if it doesn't do everything 100% better I'd rather not support the old version if it's almost obsolete already, and will probably be more so in the future.

@Acly
Copy link
Owner

Acly commented Oct 6, 2025

Another question: which model support LoRA? Do any of the Nunchaku quants support them?

I wonder if we need to disable the LoRA UI, or at least make sure there's an error of some kind which explains LoRA can't be used. Not sure what the timeline is for Nunchaku Qwen Lora support, it would be really nice to have (also to only have 1 model and a Lightning sampler which uses the Lora)

@Danamir
Copy link
Contributor Author

Danamir commented Oct 6, 2025

Another question: which model support LoRA? Do any of the Nunchaku quants support them?

I wonder if we need to disable the LoRA UI, or at least make sure there's an error of some kind which explains LoRA can't be used. Not sure what the timeline is for Nunchaku Qwen Lora support, it would be really nice to have (also to only have 1 model and a Lightning sampler which uses the Lora)

I'm with you on this one, I really dislike having to use many models where a few LoRA would be enough.

The Nunchaku code to support LoRA loading is supposedly ready, cf. this comment : nunchaku-ai/ComfyUI-nunchaku#479 (comment) . I don't know how long it will take for it to be available in ComfyUI.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 6, 2025

Question: do we really need to support both edit model variants?

Sadly the 2509 variant is seriously lacking in the styling department. It's a joke how bad it is, not even usable for this purpose. That's why I decided to keep both the recent and older architectures for now, otherwise it would amputate a pretty nice feature.

As soon as they create a model capable of both, we can drop the older code ! I'm pretty sure they wont go back to the simple TextEncodeQwenImageEdit node in the future iterations, but who knows if they'll require the use a new node for each version...

I'd rather keep the backward compatibility for now, as long at it does not increase the technical debt too much.

@Danamir
Copy link
Contributor Author

Danamir commented Oct 7, 2025

I just saw on another computer that the Qwen svg icons necessitate a local font to be displayed correctly, I wrongly thought Inkskape would vectorize the text. 😅

I'll let you create better ones, those were only placeholders anyway.

@Acly Acly merged commit ff40cf4 into Acly:main Oct 8, 2025
2 checks passed
@Acly
Copy link
Owner

Acly commented Oct 9, 2025

I made a small follow-up here: #2076

Have a look if you want, I moved the qwen-text-encode a bit further to the same level as regular clip-text-encode, hope I didn't break something.

Thanks for your work!

@shadowlocked
Copy link

shadowlocked commented Oct 10, 2025

The PR has been tested with GGUF and Nunchaku versions. I do not have a non-quantized qwen version, neither do I have a ComfyUI installation without the nunchaku package.

I just found out that the same behavior occurs in a vanilla install for the Chroma profile - if the user does not have Nunchaku installed and prioritized, the 'default' behavior throws an error, presumably because your own system defaults to Nunchaku. The non-Nunchaku user has to duplicate the affected profile and know to change diffusion architecture from 'Automatic' to 'Chroma'.

Also the VAE choice has to be changed from 'automatic' (to, for instance, ae.safetensors).

I understand that you don't want to affect an installation that you clearly like, but for the sake of other users, could it not be forked into a version where Nunchaku is not the default? This would mean that the current issues for non-Nunchaku users trying to use Qwen and Chroma, etc. would not manifest.

image image

@Acly
Copy link
Owner

Acly commented Oct 10, 2025

@shadowlocked I think your problem has nothing to do with Nunchaku. Chroma doesn't have a Nunchaku version at all. The built-in Flux/Qwen styles are setup to look for a number of potential models, including Nunchaku and non-Nunchaku variants, and take whatever they find.

Maybe it's just that you put the model in the wrong folder? It should be in diffusion_models. Or you have multiple conflicting VAE models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants